Mercedes-Benz Greener Manufacturing

Project 1

DESCRIPTION

Reduce the time a Mercedes-Benz spends on the test bench.

Problem Statement Scenario: Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

Following actions should be performed:

If for any column(s), the variance is equal to zero, then you need to remove those variable(s). Check for null and unique values for test and train sets. Apply label encoder. Perform dimensionality reduction. Predict your test_df values using XGBoost.

In [107]:
import os as os
os.getcwd()
Out[107]:
'D:\\Data Science\\SL Downloads\\dataset\\Merc Benz'
In [108]:
os.chdir('D:\Data Science\SL Downloads\dataset\Merc Benz')
In [8]:
os.getcwd()
Out[8]:
'D:\\Data Science\\SL Downloads\\dataset\\Merc Benz'
In [109]:
import pandas as pd
rawdata=pd.read_csv('./train/train.csv')
In [10]:
rawdata.head(5)
Out[10]:
ID y X0 X1 X2 X3 X4 X5 X6 X8 ... X375 X376 X377 X378 X379 X380 X382 X383 X384 X385
0 0 130.81 k v at a d u j o ... 0 0 1 0 0 0 0 0 0 0
1 6 88.53 k t av e d y l o ... 1 0 0 0 0 0 0 0 0 0
2 7 76.26 az w n c d x j x ... 0 0 0 0 0 0 1 0 0 0
3 9 80.62 az t n f d x l e ... 0 0 0 0 0 0 0 0 0 0
4 13 78.02 az v n f d h d n ... 0 0 0 0 0 0 0 0 0 0

5 rows × 378 columns

In [11]:
# List of int, float,object columns

Types = rawdata.dtypes.reset_index()
Types.columns = ["Count", "Column Type"]
Types.groupby("Column Type").count()
Out[11]:
Count
Column Type
int64 369
float64 1
object 8

There are 8 categoric columns and remaining are numeric

In [12]:
rawdata.shape
Out[12]:
(4209, 378)
In [13]:
rawdata.describe()
Out[13]:
ID y X10 X11 X12 X13 X14 X15 X16 X17 ... X375 X376 X377 X378 X379 X380 X382 X383 X384 X385
count 4209.000000 4209.000000 4209.000000 4209.0 4209.000000 4209.000000 4209.000000 4209.000000 4209.000000 4209.000000 ... 4209.000000 4209.000000 4209.000000 4209.000000 4209.000000 4209.000000 4209.000000 4209.000000 4209.000000 4209.000000
mean 4205.960798 100.669318 0.013305 0.0 0.075077 0.057971 0.428130 0.000475 0.002613 0.007603 ... 0.318841 0.057258 0.314802 0.020670 0.009503 0.008078 0.007603 0.001663 0.000475 0.001426
std 2437.608688 12.679381 0.114590 0.0 0.263547 0.233716 0.494867 0.021796 0.051061 0.086872 ... 0.466082 0.232363 0.464492 0.142294 0.097033 0.089524 0.086872 0.040752 0.021796 0.037734
min 0.000000 72.110000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2095.000000 90.820000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 4220.000000 99.150000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 6314.000000 109.010000 0.000000 0.0 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 ... 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 8417.000000 265.320000 1.000000 0.0 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 370 columns

In [14]:
rawdata.isnull().sum()
Out[14]:
ID      0
y       0
X0      0
X1      0
X2      0
       ..
X380    0
X382    0
X383    0
X384    0
X385    0
Length: 378, dtype: int64
In [15]:
import seaborn as sns
sns.heatmap(rawdata.isnull())
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b03e141388>

Looks like no null values in any columns

Checking unique values in each columns

In [128]:
#rawdata['X0'].unique()

for i in rawdata.drop(['ID','y'],axis=1):
    print('Unique elements in.. '+ i +'-column')
    print(rawdata[i].unique())
    print('-----------------------------------------')
Unique elements in.. X0-column
['k' 'az' 't' 'al' 'o' 'w' 'j' 'h' 's' 'n' 'ay' 'f' 'x' 'y' 'aj' 'ak' 'am'
 'z' 'q' 'at' 'ap' 'v' 'af' 'a' 'e' 'ai' 'd' 'aq' 'c' 'aa' 'ba' 'as' 'i'
 'r' 'b' 'ax' 'bc' 'u' 'ad' 'au' 'm' 'l' 'aw' 'ao' 'ac' 'g' 'ab']
-----------------------------------------
Unique elements in.. X1-column
['v' 't' 'w' 'b' 'r' 'l' 's' 'aa' 'c' 'a' 'e' 'h' 'z' 'j' 'o' 'u' 'p' 'n'
 'i' 'y' 'd' 'f' 'm' 'k' 'g' 'q' 'ab']
-----------------------------------------
Unique elements in.. X2-column
['at' 'av' 'n' 'e' 'as' 'aq' 'r' 'ai' 'ak' 'm' 'a' 'k' 'ae' 's' 'f' 'd'
 'ag' 'ay' 'ac' 'ap' 'g' 'i' 'aw' 'y' 'b' 'ao' 'al' 'h' 'x' 'au' 't' 'an'
 'z' 'ah' 'p' 'am' 'j' 'q' 'af' 'l' 'aa' 'c' 'o' 'ar']
-----------------------------------------
Unique elements in.. X3-column
['a' 'e' 'c' 'f' 'd' 'b' 'g']
-----------------------------------------
Unique elements in.. X4-column
['d' 'b' 'c' 'a']
-----------------------------------------
Unique elements in.. X5-column
['u' 'y' 'x' 'h' 'g' 'f' 'j' 'i' 'd' 'c' 'af' 'ag' 'ab' 'ac' 'ad' 'ae'
 'ah' 'l' 'k' 'n' 'm' 'p' 'q' 's' 'r' 'v' 'w' 'o' 'aa']
-----------------------------------------
Unique elements in.. X6-column
['j' 'l' 'd' 'h' 'i' 'a' 'g' 'c' 'k' 'e' 'f' 'b']
-----------------------------------------
Unique elements in.. X8-column
['o' 'x' 'e' 'n' 's' 'a' 'h' 'p' 'm' 'k' 'd' 'i' 'v' 'j' 'b' 'q' 'w' 'g'
 'y' 'l' 'f' 'u' 'r' 't' 'c']
-----------------------------------------
Unique elements in.. X10-column
[0 1]
-----------------------------------------
Unique elements in.. X11-column
[0]
-----------------------------------------
Unique elements in.. X12-column
[0 1]
-----------------------------------------
Unique elements in.. X13-column
[1 0]
-----------------------------------------
Unique elements in.. X14-column
[0 1]
-----------------------------------------
Unique elements in.. X15-column
[0 1]
-----------------------------------------
Unique elements in.. X16-column
[0 1]
-----------------------------------------
Unique elements in.. X17-column
[0 1]
-----------------------------------------
Unique elements in.. X18-column
[1 0]
-----------------------------------------
Unique elements in.. X19-column
[0 1]
-----------------------------------------
Unique elements in.. X20-column
[0 1]
-----------------------------------------
Unique elements in.. X21-column
[1 0]
-----------------------------------------
Unique elements in.. X22-column
[0 1]
-----------------------------------------
Unique elements in.. X23-column
[0 1]
-----------------------------------------
Unique elements in.. X24-column
[0 1]
-----------------------------------------
Unique elements in.. X26-column
[0 1]
-----------------------------------------
Unique elements in.. X27-column
[0 1]
-----------------------------------------
Unique elements in.. X28-column
[0 1]
-----------------------------------------
Unique elements in.. X29-column
[0 1]
-----------------------------------------
Unique elements in.. X30-column
[0 1]
-----------------------------------------
Unique elements in.. X31-column
[1 0]
-----------------------------------------
Unique elements in.. X32-column
[0 1]
-----------------------------------------
Unique elements in.. X33-column
[0 1]
-----------------------------------------
Unique elements in.. X34-column
[0 1]
-----------------------------------------
Unique elements in.. X35-column
[1 0]
-----------------------------------------
Unique elements in.. X36-column
[0 1]
-----------------------------------------
Unique elements in.. X37-column
[1 0]
-----------------------------------------
Unique elements in.. X38-column
[0 1]
-----------------------------------------
Unique elements in.. X39-column
[0 1]
-----------------------------------------
Unique elements in.. X40-column
[0 1]
-----------------------------------------
Unique elements in.. X41-column
[0 1]
-----------------------------------------
Unique elements in.. X42-column
[0 1]
-----------------------------------------
Unique elements in.. X43-column
[0 1]
-----------------------------------------
Unique elements in.. X44-column
[0 1]
-----------------------------------------
Unique elements in.. X45-column
[0 1]
-----------------------------------------
Unique elements in.. X46-column
[1 0]
-----------------------------------------
Unique elements in.. X47-column
[0 1]
-----------------------------------------
Unique elements in.. X48-column
[0 1]
-----------------------------------------
Unique elements in.. X49-column
[0 1]
-----------------------------------------
Unique elements in.. X50-column
[0 1]
-----------------------------------------
Unique elements in.. X51-column
[0 1]
-----------------------------------------
Unique elements in.. X52-column
[0 1]
-----------------------------------------
Unique elements in.. X53-column
[0 1]
-----------------------------------------
Unique elements in.. X54-column
[0 1]
-----------------------------------------
Unique elements in.. X55-column
[0 1]
-----------------------------------------
Unique elements in.. X56-column
[0 1]
-----------------------------------------
Unique elements in.. X57-column
[0 1]
-----------------------------------------
Unique elements in.. X58-column
[1 0]
-----------------------------------------
Unique elements in.. X59-column
[0 1]
-----------------------------------------
Unique elements in.. X60-column
[0 1]
-----------------------------------------
Unique elements in.. X61-column
[0 1]
-----------------------------------------
Unique elements in.. X62-column
[0 1]
-----------------------------------------
Unique elements in.. X63-column
[0 1]
-----------------------------------------
Unique elements in.. X64-column
[0 1]
-----------------------------------------
Unique elements in.. X65-column
[0 1]
-----------------------------------------
Unique elements in.. X66-column
[0 1]
-----------------------------------------
Unique elements in.. X67-column
[0 1]
-----------------------------------------
Unique elements in.. X68-column
[1 0]
-----------------------------------------
Unique elements in.. X69-column
[0 1]
-----------------------------------------
Unique elements in.. X70-column
[1 0]
-----------------------------------------
Unique elements in.. X71-column
[0 1]
-----------------------------------------
Unique elements in.. X73-column
[0 1]
-----------------------------------------
Unique elements in.. X74-column
[1 0]
-----------------------------------------
Unique elements in.. X75-column
[0 1]
-----------------------------------------
Unique elements in.. X76-column
[0 1]
-----------------------------------------
Unique elements in.. X77-column
[0 1]
-----------------------------------------
Unique elements in.. X78-column
[0 1]
-----------------------------------------
Unique elements in.. X79-column
[0 1]
-----------------------------------------
Unique elements in.. X80-column
[0 1]
-----------------------------------------
Unique elements in.. X81-column
[0 1]
-----------------------------------------
Unique elements in.. X82-column
[0 1]
-----------------------------------------
Unique elements in.. X83-column
[0 1]
-----------------------------------------
Unique elements in.. X84-column
[0 1]
-----------------------------------------
Unique elements in.. X85-column
[1 0]
-----------------------------------------
Unique elements in.. X86-column
[0 1]
-----------------------------------------
Unique elements in.. X87-column
[0 1]
-----------------------------------------
Unique elements in.. X88-column
[0 1]
-----------------------------------------
Unique elements in.. X89-column
[0 1]
-----------------------------------------
Unique elements in.. X90-column
[0 1]
-----------------------------------------
Unique elements in.. X91-column
[0 1]
-----------------------------------------
Unique elements in.. X92-column
[0 1]
-----------------------------------------
Unique elements in.. X93-column
[0]
-----------------------------------------
Unique elements in.. X94-column
[0 1]
-----------------------------------------
Unique elements in.. X95-column
[0 1]
-----------------------------------------
Unique elements in.. X96-column
[0 1]
-----------------------------------------
Unique elements in.. X97-column
[0 1]
-----------------------------------------
Unique elements in.. X98-column
[0 1]
-----------------------------------------
Unique elements in.. X99-column
[0 1]
-----------------------------------------
Unique elements in.. X100-column
[0 1]
-----------------------------------------
Unique elements in.. X101-column
[0 1]
-----------------------------------------
Unique elements in.. X102-column
[0 1]
-----------------------------------------
Unique elements in.. X103-column
[0 1]
-----------------------------------------
Unique elements in.. X104-column
[0 1]
-----------------------------------------
Unique elements in.. X105-column
[0 1]
-----------------------------------------
Unique elements in.. X106-column
[0 1]
-----------------------------------------
Unique elements in.. X107-column
[0]
-----------------------------------------
Unique elements in.. X108-column
[0 1]
-----------------------------------------
Unique elements in.. X109-column
[0 1]
-----------------------------------------
Unique elements in.. X110-column
[0 1]
-----------------------------------------
Unique elements in.. X111-column
[1 0]
-----------------------------------------
Unique elements in.. X112-column
[0 1]
-----------------------------------------
Unique elements in.. X113-column
[0 1]
-----------------------------------------
Unique elements in.. X114-column
[1 0]
-----------------------------------------
Unique elements in.. X115-column
[0 1]
-----------------------------------------
Unique elements in.. X116-column
[1 0]
-----------------------------------------
Unique elements in.. X117-column
[0 1]
-----------------------------------------
Unique elements in.. X118-column
[1 0]
-----------------------------------------
Unique elements in.. X119-column
[1 0]
-----------------------------------------
Unique elements in.. X120-column
[1 0]
-----------------------------------------
Unique elements in.. X122-column
[0 1]
-----------------------------------------
Unique elements in.. X123-column
[0 1]
-----------------------------------------
Unique elements in.. X124-column
[0 1]
-----------------------------------------
Unique elements in.. X125-column
[0 1]
-----------------------------------------
Unique elements in.. X126-column
[0 1]
-----------------------------------------
Unique elements in.. X127-column
[0 1]
-----------------------------------------
Unique elements in.. X128-column
[1 0]
-----------------------------------------
Unique elements in.. X129-column
[0 1]
-----------------------------------------
Unique elements in.. X130-column
[0 1]
-----------------------------------------
Unique elements in.. X131-column
[1 0]
-----------------------------------------
Unique elements in.. X132-column
[0 1]
-----------------------------------------
Unique elements in.. X133-column
[0 1]
-----------------------------------------
Unique elements in.. X134-column
[0 1]
-----------------------------------------
Unique elements in.. X135-column
[0 1]
-----------------------------------------
Unique elements in.. X136-column
[1 0]
-----------------------------------------
Unique elements in.. X137-column
[1 0]
-----------------------------------------
Unique elements in.. X138-column
[0 1]
-----------------------------------------
Unique elements in.. X139-column
[0 1]
-----------------------------------------
Unique elements in.. X140-column
[0 1]
-----------------------------------------
Unique elements in.. X141-column
[0 1]
-----------------------------------------
Unique elements in.. X142-column
[1 0]
-----------------------------------------
Unique elements in.. X143-column
[0 1]
-----------------------------------------
Unique elements in.. X144-column
[1 0]
-----------------------------------------
Unique elements in.. X145-column
[0 1]
-----------------------------------------
Unique elements in.. X146-column
[0 1]
-----------------------------------------
Unique elements in.. X147-column
[0 1]
-----------------------------------------
Unique elements in.. X148-column
[0 1]
-----------------------------------------
Unique elements in.. X150-column
[1 0]
-----------------------------------------
Unique elements in.. X151-column
[0 1]
-----------------------------------------
Unique elements in.. X152-column
[0 1]
-----------------------------------------
Unique elements in.. X153-column
[0 1]
-----------------------------------------
Unique elements in.. X154-column
[0 1]
-----------------------------------------
Unique elements in.. X155-column
[0 1]
-----------------------------------------
Unique elements in.. X156-column
[1 0]
-----------------------------------------
Unique elements in.. X157-column
[0 1]
-----------------------------------------
Unique elements in.. X158-column
[0 1]
-----------------------------------------
Unique elements in.. X159-column
[0 1]
-----------------------------------------
Unique elements in.. X160-column
[0 1]
-----------------------------------------
Unique elements in.. X161-column
[0 1]
-----------------------------------------
Unique elements in.. X162-column
[0 1]
-----------------------------------------
Unique elements in.. X163-column
[0 1]
-----------------------------------------
Unique elements in.. X164-column
[0 1]
-----------------------------------------
Unique elements in.. X165-column
[0 1]
-----------------------------------------
Unique elements in.. X166-column
[0 1]
-----------------------------------------
Unique elements in.. X167-column
[0 1]
-----------------------------------------
Unique elements in.. X168-column
[0 1]
-----------------------------------------
Unique elements in.. X169-column
[0 1]
-----------------------------------------
Unique elements in.. X170-column
[1 0]
-----------------------------------------
Unique elements in.. X171-column
[0 1]
-----------------------------------------
Unique elements in.. X172-column
[0 1]
-----------------------------------------
Unique elements in.. X173-column
[0 1]
-----------------------------------------
Unique elements in.. X174-column
[0 1]
-----------------------------------------
Unique elements in.. X175-column
[0 1]
-----------------------------------------
Unique elements in.. X176-column
[0 1]
-----------------------------------------
Unique elements in.. X177-column
[0 1]
-----------------------------------------
Unique elements in.. X178-column
[0 1]
-----------------------------------------
Unique elements in.. X179-column
[1 0]
-----------------------------------------
Unique elements in.. X180-column
[0 1]
-----------------------------------------
Unique elements in.. X181-column
[0 1]
-----------------------------------------
Unique elements in.. X182-column
[0 1]
-----------------------------------------
Unique elements in.. X183-column
[0 1]
-----------------------------------------
Unique elements in.. X184-column
[1 0]
-----------------------------------------
Unique elements in.. X185-column
[0 1]
-----------------------------------------
Unique elements in.. X186-column
[0 1]
-----------------------------------------
Unique elements in.. X187-column
[1 0]
-----------------------------------------
Unique elements in.. X189-column
[1 0]
-----------------------------------------
Unique elements in.. X190-column
[0 1]
-----------------------------------------
Unique elements in.. X191-column
[0 1]
-----------------------------------------
Unique elements in.. X192-column
[0 1]
-----------------------------------------
Unique elements in.. X194-column
[1 0]
-----------------------------------------
Unique elements in.. X195-column
[0 1]
-----------------------------------------
Unique elements in.. X196-column
[0 1]
-----------------------------------------
Unique elements in.. X197-column
[0 1]
-----------------------------------------
Unique elements in.. X198-column
[0 1]
-----------------------------------------
Unique elements in.. X199-column
[0 1]
-----------------------------------------
Unique elements in.. X200-column
[0 1]
-----------------------------------------
Unique elements in.. X201-column
[0 1]
-----------------------------------------
Unique elements in.. X202-column
[0 1]
-----------------------------------------
Unique elements in.. X203-column
[0 1]
-----------------------------------------
Unique elements in.. X204-column
[1 0]
-----------------------------------------
Unique elements in.. X205-column
[0 1]
-----------------------------------------
Unique elements in.. X206-column
[0 1]
-----------------------------------------
Unique elements in.. X207-column
[0 1]
-----------------------------------------
Unique elements in.. X208-column
[0 1]
-----------------------------------------
Unique elements in.. X209-column
[1 0]
-----------------------------------------
Unique elements in.. X210-column
[0 1]
-----------------------------------------
Unique elements in.. X211-column
[0 1]
-----------------------------------------
Unique elements in.. X212-column
[0 1]
-----------------------------------------
Unique elements in.. X213-column
[0 1]
-----------------------------------------
Unique elements in.. X214-column
[0 1]
-----------------------------------------
Unique elements in.. X215-column
[0 1]
-----------------------------------------
Unique elements in.. X216-column
[0 1]
-----------------------------------------
Unique elements in.. X217-column
[0 1]
-----------------------------------------
Unique elements in.. X218-column
[0 1]
-----------------------------------------
Unique elements in.. X219-column
[0 1]
-----------------------------------------
Unique elements in.. X220-column
[1 0]
-----------------------------------------
Unique elements in.. X221-column
[0 1]
-----------------------------------------
Unique elements in.. X222-column
[0 1]
-----------------------------------------
Unique elements in.. X223-column
[0 1]
-----------------------------------------
Unique elements in.. X224-column
[0 1]
-----------------------------------------
Unique elements in.. X225-column
[0 1]
-----------------------------------------
Unique elements in.. X226-column
[0 1]
-----------------------------------------
Unique elements in.. X227-column
[0 1]
-----------------------------------------
Unique elements in.. X228-column
[0 1]
-----------------------------------------
Unique elements in.. X229-column
[0 1]
-----------------------------------------
Unique elements in.. X230-column
[0 1]
-----------------------------------------
Unique elements in.. X231-column
[0 1]
-----------------------------------------
Unique elements in.. X232-column
[0 1]
-----------------------------------------
Unique elements in.. X233-column
[0]
-----------------------------------------
Unique elements in.. X234-column
[1 0]
-----------------------------------------
Unique elements in.. X235-column
[0]
-----------------------------------------
Unique elements in.. X236-column
[0 1]
-----------------------------------------
Unique elements in.. X237-column
[1 0]
-----------------------------------------
Unique elements in.. X238-column
[0 1]
-----------------------------------------
Unique elements in.. X239-column
[0 1]
-----------------------------------------
Unique elements in.. X240-column
[0 1]
-----------------------------------------
Unique elements in.. X241-column
[0 1]
-----------------------------------------
Unique elements in.. X242-column
[0 1]
-----------------------------------------
Unique elements in.. X243-column
[0 1]
-----------------------------------------
Unique elements in.. X244-column
[0 1]
-----------------------------------------
Unique elements in.. X245-column
[0 1]
-----------------------------------------
Unique elements in.. X246-column
[0 1]
-----------------------------------------
Unique elements in.. X247-column
[0 1]
-----------------------------------------
Unique elements in.. X248-column
[0 1]
-----------------------------------------
Unique elements in.. X249-column
[0 1]
-----------------------------------------
Unique elements in.. X250-column
[0 1]
-----------------------------------------
Unique elements in.. X251-column
[0 1]
-----------------------------------------
Unique elements in.. X252-column
[0 1]
-----------------------------------------
Unique elements in.. X253-column
[0 1]
-----------------------------------------
Unique elements in.. X254-column
[0 1]
-----------------------------------------
Unique elements in.. X255-column
[0 1]
-----------------------------------------
Unique elements in.. X256-column
[0 1]
-----------------------------------------
Unique elements in.. X257-column
[0 1]
-----------------------------------------
Unique elements in.. X258-column
[0 1]
-----------------------------------------
Unique elements in.. X259-column
[0 1]
-----------------------------------------
Unique elements in.. X260-column
[0 1]
-----------------------------------------
Unique elements in.. X261-column
[0 1]
-----------------------------------------
Unique elements in.. X262-column
[1 0]
-----------------------------------------
Unique elements in.. X263-column
[1 0]
-----------------------------------------
Unique elements in.. X264-column
[0 1]
-----------------------------------------
Unique elements in.. X265-column
[0 1]
-----------------------------------------
Unique elements in.. X266-column
[1 0]
-----------------------------------------
Unique elements in.. X267-column
[0 1]
-----------------------------------------
Unique elements in.. X268-column
[0]
-----------------------------------------
Unique elements in.. X269-column
[0 1]
-----------------------------------------
Unique elements in.. X270-column
[0 1]
-----------------------------------------
Unique elements in.. X271-column
[0 1]
-----------------------------------------
Unique elements in.. X272-column
[0 1]
-----------------------------------------
Unique elements in.. X273-column
[1 0]
-----------------------------------------
Unique elements in.. X274-column
[0 1]
-----------------------------------------
Unique elements in.. X275-column
[1 0]
-----------------------------------------
Unique elements in.. X276-column
[0 1]
-----------------------------------------
Unique elements in.. X277-column
[0 1]
-----------------------------------------
Unique elements in.. X278-column
[0 1]
-----------------------------------------
Unique elements in.. X279-column
[0 1]
-----------------------------------------
Unique elements in.. X280-column
[0 1]
-----------------------------------------
Unique elements in.. X281-column
[0 1]
-----------------------------------------
Unique elements in.. X282-column
[0 1]
-----------------------------------------
Unique elements in.. X283-column
[0 1]
-----------------------------------------
Unique elements in.. X284-column
[0 1]
-----------------------------------------
Unique elements in.. X285-column
[1 0]
-----------------------------------------
Unique elements in.. X286-column
[0 1]
-----------------------------------------
Unique elements in.. X287-column
[0 1]
-----------------------------------------
Unique elements in.. X288-column
[0 1]
-----------------------------------------
Unique elements in.. X289-column
[0]
-----------------------------------------
Unique elements in.. X290-column
[0]
-----------------------------------------
Unique elements in.. X291-column
[0 1]
-----------------------------------------
Unique elements in.. X292-column
[0 1]
-----------------------------------------
Unique elements in.. X293-column
[0]
-----------------------------------------
Unique elements in.. X294-column
[0 1]
-----------------------------------------
Unique elements in.. X295-column
[0 1]
-----------------------------------------
Unique elements in.. X296-column
[0 1]
-----------------------------------------
Unique elements in.. X297-column
[0]
-----------------------------------------
Unique elements in.. X298-column
[0 1]
-----------------------------------------
Unique elements in.. X299-column
[0 1]
-----------------------------------------
Unique elements in.. X300-column
[0 1]
-----------------------------------------
Unique elements in.. X301-column
[0 1]
-----------------------------------------
Unique elements in.. X302-column
[0 1]
-----------------------------------------
Unique elements in.. X304-column
[0 1]
-----------------------------------------
Unique elements in.. X305-column
[0 1]
-----------------------------------------
Unique elements in.. X306-column
[1 0]
-----------------------------------------
Unique elements in.. X307-column
[0 1]
-----------------------------------------
Unique elements in.. X308-column
[0 1]
-----------------------------------------
Unique elements in.. X309-column
[0 1]
-----------------------------------------
Unique elements in.. X310-column
[0 1]
-----------------------------------------
Unique elements in.. X311-column
[0 1]
-----------------------------------------
Unique elements in.. X312-column
[0 1]
-----------------------------------------
Unique elements in.. X313-column
[0 1]
-----------------------------------------
Unique elements in.. X314-column
[0 1]
-----------------------------------------
Unique elements in.. X315-column
[0 1]
-----------------------------------------
Unique elements in.. X316-column
[1 0]
-----------------------------------------
Unique elements in.. X317-column
[0 1]
-----------------------------------------
Unique elements in.. X318-column
[0 1]
-----------------------------------------
Unique elements in.. X319-column
[0 1]
-----------------------------------------
Unique elements in.. X320-column
[0 1]
-----------------------------------------
Unique elements in.. X321-column
[0 1]
-----------------------------------------
Unique elements in.. X322-column
[0 1]
-----------------------------------------
Unique elements in.. X323-column
[0 1]
-----------------------------------------
Unique elements in.. X324-column
[1 0]
-----------------------------------------
Unique elements in.. X325-column
[0 1]
-----------------------------------------
Unique elements in.. X326-column
[0 1]
-----------------------------------------
Unique elements in.. X327-column
[1 0]
-----------------------------------------
Unique elements in.. X328-column
[0 1]
-----------------------------------------
Unique elements in.. X329-column
[1 0]
-----------------------------------------
Unique elements in.. X330-column
[0]
-----------------------------------------
Unique elements in.. X331-column
[0 1]
-----------------------------------------
Unique elements in.. X332-column
[0 1]
-----------------------------------------
Unique elements in.. X333-column
[0 1]
-----------------------------------------
Unique elements in.. X334-column
[1 0]
-----------------------------------------
Unique elements in.. X335-column
[0 1]
-----------------------------------------
Unique elements in.. X336-column
[0 1]
-----------------------------------------
Unique elements in.. X337-column
[0 1]
-----------------------------------------
Unique elements in.. X338-column
[0 1]
-----------------------------------------
Unique elements in.. X339-column
[0 1]
-----------------------------------------
Unique elements in.. X340-column
[0 1]
-----------------------------------------
Unique elements in.. X341-column
[0 1]
-----------------------------------------
Unique elements in.. X342-column
[0 1]
-----------------------------------------
Unique elements in.. X343-column
[0 1]
-----------------------------------------
Unique elements in.. X344-column
[0 1]
-----------------------------------------
Unique elements in.. X345-column
[0 1]
-----------------------------------------
Unique elements in.. X346-column
[0 1]
-----------------------------------------
Unique elements in.. X347-column
[0]
-----------------------------------------
Unique elements in.. X348-column
[0 1]
-----------------------------------------
Unique elements in.. X349-column
[0 1]
-----------------------------------------
Unique elements in.. X350-column
[0 1]
-----------------------------------------
Unique elements in.. X351-column
[0 1]
-----------------------------------------
Unique elements in.. X352-column
[0 1]
-----------------------------------------
Unique elements in.. X353-column
[0 1]
-----------------------------------------
Unique elements in.. X354-column
[1 0]
-----------------------------------------
Unique elements in.. X355-column
[0 1]
-----------------------------------------
Unique elements in.. X356-column
[0 1]
-----------------------------------------
Unique elements in.. X357-column
[0 1]
-----------------------------------------
Unique elements in.. X358-column
[0 1]
-----------------------------------------
Unique elements in.. X359-column
[0 1]
-----------------------------------------
Unique elements in.. X360-column
[0 1]
-----------------------------------------
Unique elements in.. X361-column
[1 0]
-----------------------------------------
Unique elements in.. X362-column
[0 1]
-----------------------------------------
Unique elements in.. X363-column
[0 1]
-----------------------------------------
Unique elements in.. X364-column
[0 1]
-----------------------------------------
Unique elements in.. X365-column
[0 1]
-----------------------------------------
Unique elements in.. X366-column
[0 1]
-----------------------------------------
Unique elements in.. X367-column
[0 1]
-----------------------------------------
Unique elements in.. X368-column
[0 1]
-----------------------------------------
Unique elements in.. X369-column
[0 1]
-----------------------------------------
Unique elements in.. X370-column
[0 1]
-----------------------------------------
Unique elements in.. X371-column
[0 1]
-----------------------------------------
Unique elements in.. X372-column
[0 1]
-----------------------------------------
Unique elements in.. X373-column
[0 1]
-----------------------------------------
Unique elements in.. X374-column
[0 1]
-----------------------------------------
Unique elements in.. X375-column
[0 1]
-----------------------------------------
Unique elements in.. X376-column
[0 1]
-----------------------------------------
Unique elements in.. X377-column
[1 0]
-----------------------------------------
Unique elements in.. X378-column
[0 1]
-----------------------------------------
Unique elements in.. X379-column
[0 1]
-----------------------------------------
Unique elements in.. X380-column
[0 1]
-----------------------------------------
Unique elements in.. X382-column
[0 1]
-----------------------------------------
Unique elements in.. X383-column
[0 1]
-----------------------------------------
Unique elements in.. X384-column
[0 1]
-----------------------------------------
Unique elements in.. X385-column
[0 1]
-----------------------------------------
In [11]:
import matplotlib
In [129]:
import pandas_profiling as pp
pp.ProfileReport(rawdata)








Out[129]:

Observation

Number of observations 4209

Number of variables 378

BOOL 368 CAT 8 NUM 2

Null values- None

Target Y- Distinct count-2545

X11 and few others can be neglected, because all rows have same values. Many columns show high correlation

In [14]:
rawdata.head()
Out[14]:
ID y X0 X1 X2 X3 X4 X5 X6 X8 ... X375 X376 X377 X378 X379 X380 X382 X383 X384 X385
0 0 130.81 k v at a d u j o ... 0 0 1 0 0 0 0 0 0 0
1 6 88.53 k t av e d y l o ... 1 0 0 0 0 0 0 0 0 0
2 7 76.26 az w n c d x j x ... 0 0 0 0 0 0 1 0 0 0
3 9 80.62 az t n f d x l e ... 0 0 0 0 0 0 0 0 0 0
4 13 78.02 az v n f d h d n ... 0 0 0 0 0 0 0 0 0 0

5 rows × 378 columns

In [16]:
## Annova check on Categorical columns

import statsmodels.api as sm
from statsmodels.formula.api import ols

col = ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']
for i in col:
    model = ols('y ~ ' + i , data=rawdata).fit()
    
    print('Column : {}, F-statistic : {}, p-value : {}'.format(i, model.fvalue, model.f_pvalue))
Column : X0, F-statistic : 122.31407564900334, p-value : 0.0
Column : X1, F-statistic : 6.988434069498696, p-value : 1.1280321132760776e-24
Column : X2, F-statistic : 28.256994858808234, p-value : 1.9306837593615617e-196
Column : X3, F-statistic : 30.991746795319916, p-value : 1.2512325123725502e-36
Column : X4, F-statistic : 2.6188965213725144, p-value : 0.04920919630464415
Column : X5, F-statistic : 2.152702885496953, p-value : 0.0004035846965797788
Column : X6, F-statistic : 4.1750460361125, p-value : 3.6159365940001084e-06
Column : X8, F-statistic : 5.030918412130861, p-value : 1.2692541091299133e-14

High F statistics scores columns can be kept. X4 has pvalue close to 0.05, so we fail to reject Null Hypotghesis that- This columns dont affect target.

In [18]:
Types = rawdata.dtypes.reset_index()
Types.columns = ["Count", "Column Type"]
Types.groupby("Column Type").count()

# Numeric columns
numeric=Types[Types["Column Type"]=='int64'].Count
numeric
rawdata[numeric]
Out[18]:
ID X10 X11 X12 X13 X14 X15 X16 X17 X18 ... X375 X376 X377 X378 X379 X380 X382 X383 X384 X385
0 0 0 0 0 1 0 0 0 0 1 ... 0 0 1 0 0 0 0 0 0 0
1 6 0 0 0 0 0 0 0 0 1 ... 1 0 0 0 0 0 0 0 0 0
2 7 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 1 0 0 0
3 9 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 13 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4204 8405 0 0 0 0 1 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4205 8406 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
4206 8412 0 0 1 1 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
4207 8415 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4208 8417 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0

4209 rows × 369 columns

F regression on Numeric columns

In [19]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

ftr = rawdata[numeric].drop(['ID'],axis=1)
trgt= rawdata.y

fs= SelectKBest(f_regression, k="all")
fs.fit(ftr,trgt)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_selection\_univariate_selection.py:299: RuntimeWarning: invalid value encountered in true_divide
  corr /= X_norms
C:\Users\dell\AppData\Roaming\Python\Python37\site-packages\scipy\stats\_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in greater
  return (a < x) & (x < b)
C:\Users\dell\AppData\Roaming\Python\Python37\site-packages\scipy\stats\_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in less
  return (a < x) & (x < b)
C:\Users\dell\AppData\Roaming\Python\Python37\site-packages\scipy\stats\_distn_infrastructure.py:1912: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= _a)
Out[19]:
SelectKBest(k='all', score_func=<function f_regression at 0x000001B03FBB6F78>)
In [20]:
scores  = list(fs.scores_)
pvalues = list(fs.pvalues_)
fcols= list(ftr.columns)

scores[0:5]
pvalues[0:5]
fcols[0:5]
Out[20]:
['X10', 'X11', 'X12', 'X13', 'X14']
In [21]:
# List of tuples with feature and their importance
table = [(col, score, round (pvalue,4))  for col, score, pvalue in zip(fcols, scores, pvalues)]
print(table[0:5])
[('X10', 3.065618674114047, 0.08), ('X11', nan, nan), ('X12', 34.19472628005719, 0.0), ('X13', 9.82760413138798, 0.0017), ('X14', 163.8983133169353, 0.0)]
In [22]:
## sorting with pvalues in desc order of pvalues
table= sorted(table, key = lambda x: x[2], reverse = True)
print(table[0:5])
[('X11', nan, nan), ('X40', 0.003548743596460469, 0.9525), ('X32', 0.012882504616845423, 0.9096), ('X18', 0.01345775472717495, 0.9077), ('X92', 0.04617602698035259, 0.8299)]
In [25]:
#Lets put them in a dataframe

newdf= pd.DataFrame(table, columns = ['Colname', 'fscore', 'pvalue'])
newdf

# There are some NAN values too
Out[25]:
Colname fscore pvalue
0 X11 NaN NaN
1 X40 0.003549 0.9525
2 X32 0.012883 0.9096
3 X18 0.013458 0.9077
4 X92 0.046176 0.8299
... ... ... ...
363 X371 204.639375 0.0000
364 X376 55.399436 0.0000
365 X378 301.699496 0.0000
366 X379 19.496687 0.0000
367 X382 110.266258 0.0000

368 rows × 3 columns

In [26]:
# Our Null Hyp - These columns don't have any effect on target
# If pvalue < 0.05 we reject Null Hyp, otherwise fail to reject Null Hyp
# So for pvalues greater than 0.05 , they don't have any effect on target
# Hence, high p values can be rejected, meaning those columns can be neglected.

dropcols= newdf[newdf['pvalue'] > 0.05]
#dropcols=dropcols['Colname']
dropcols=dropcols.Colname.values
dropcols
Out[26]:
array(['X40', 'X32', 'X18', 'X92', 'X24', 'X42', 'X83', 'X103', 'X49',
       'X89', 'X86', 'X38', 'X41', 'X87', 'X74', 'X33', 'X39', 'X36',
       'X26', 'X70', 'X59', 'X60', 'X58', 'X57', 'X15', 'X95', 'X73',
       'X10', 'X63', 'X65', 'X67', 'X210', 'X207', 'X257', 'X258', 'X230',
       'X254', 'X266', 'X200', 'X206', 'X248', 'X220', 'X213', 'X240',
       'X245', 'X203', 'X226', 'X288', 'X262', 'X259', 'X280', 'X253',
       'X260', 'X246', 'X340', 'X294', 'X175', 'X292', 'X296', 'X364',
       'X365', 'X332', 'X123', 'X366', 'X338', 'X145', 'X182', 'X139',
       'X384', 'X114', 'X117', 'X105', 'X168', 'X129', 'X164', 'X196',
       'X184', 'X181', 'X190', 'X192', 'X124', 'X153', 'X345', 'X295',
       'X319', 'X359', 'X186', 'X194', 'X369', 'X374', 'X357', 'X356',
       'X138', 'X146', 'X160', 'X358', 'X140', 'X173', 'X353', 'X385',
       'X324', 'X361', 'X133', 'X195', 'X104', 'X323', 'X161', 'X375',
       'X307', 'X143', 'X152', 'X326', 'X141', 'X318'], dtype=object)
In [27]:
# Lets check Nan values
newdf.isnull().sum()
# 12 columns have Null values
newdf[newdf['fscore'].isnull()]
Out[27]:
Colname fscore pvalue
0 X11 NaN NaN
8 X93 NaN NaN
92 X107 NaN NaN
105 X233 NaN NaN
106 X235 NaN NaN
144 X268 NaN NaN
182 X289 NaN NaN
183 X290 NaN NaN
185 X293 NaN NaN
191 X297 NaN NaN
192 X330 NaN NaN
193 X347 NaN NaN
In [28]:
rawdata['X11']
#rawdata['X11'].unique()
rawdata['X11'].sum()
## These columns all have only one value '0', also highlighted by Pandas profiling
## These columns can also be ignored.
Out[28]:
0

Now lets try Machine learning models for regression

  1. Without discarding any columns, lets check the performance
In [29]:
#One hot encoding

import pandas as pd
rawdata=pd.get_dummies(rawdata)
In [31]:
rawdata
Out[31]:
ID y X10 X11 X12 X13 X14 X15 X16 X17 ... X8_p X8_q X8_r X8_s X8_t X8_u X8_v X8_w X8_x X8_y
0 0 130.81 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 6 88.53 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 7 76.26 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
3 9 80.62 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 13 78.02 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4204 8405 107.39 0 0 0 0 1 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
4205 8406 108.77 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4206 8412 109.22 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4207 8415 87.48 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
4208 8417 110.85 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

4209 rows × 565 columns

In [30]:
# Collecting X and Y
X = rawdata.drop(['ID','y'],axis=1).values
Y = rawdata['y'].values
In [200]:
X
Out[200]:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0]], dtype=int64)
In [201]:
Y
Out[201]:
array([130.81,  88.53,  76.26, ..., 109.22,  87.48, 110.85])
In [33]:
# Splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=1, test_size=0.3)
In [34]:
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)
(2946, 563)
(2946,)
(1263, 563)
(1263,)

Linear Regression

In [35]:
# import the ML algorithm
from sklearn.linear_model import LinearRegression

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, Y_train)
Out[35]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [191]:
linreg.coef_
Out[191]:
array([ 6.38201729e+09,  2.52930663e+13, -6.21807830e+12,  1.26166743e+12,
       -6.21807830e+12, -8.42227925e+12, -7.20543087e+12, -1.18912553e+13,
       -6.21807830e+12, -6.21807830e+12, -6.21807830e+12, -3.17231637e+13,
       -6.21807830e+12, -8.91930923e+13,  5.99955663e+12,  4.74955535e+13,
        5.25944233e-02, -3.70612945e+13,  1.20189075e+14, -3.97024793e+12,
       -7.65928169e+13,  3.03615617e+13, -4.82158587e+13,  6.38201729e+09,
        8.19696818e+13, -8.79691189e+12, -5.37686482e+12,  3.84900331e-01,
        1.17351531e+11,  4.89890727e+12,  4.89890727e+12,  4.76025271e+00,
       -1.96779573e+13, -2.11252151e+12, -7.92321777e+00,  3.75000000e-01,
        5.19995117e+00,  1.57234763e+11,  3.30883789e+00, -5.82519531e-01,
        2.71240234e-01, -1.85626865e+12,  8.01052120e+12, -1.25531952e+13,
       -1.34082031e+00,  3.00292969e-01, -5.13769531e+00, -2.41617579e+13,
        4.81287771e+12, -8.66353489e+12, -1.08688100e+13, -1.46498207e+13,
       -1.51517029e+12, -2.14843750e-02, -3.76098633e+00, -5.56639123e+12,
       -3.44998902e+12, -1.26953125e+00, -1.52539062e+00, -1.55468750e+00,
       -2.16326859e+13,  2.86718750e+00,  1.95087441e+13, -2.93408203e+00,
       -1.18434933e+13,  4.89890727e+12, -9.91455078e-01, -7.46093750e-01,
        4.20470049e+13, -2.21093750e+00,  1.56250000e-01, -6.08202007e+12,
        1.60356374e+13,  1.21777344e+00, -7.18528796e+12,  2.79862921e+12,
        6.53253330e+12, -5.33099966e+12, -2.64101263e+12, -1.41687012e+00,
        1.52463735e+13,  1.53077381e+12, -6.98862701e+12, -1.34788649e+12,
       -1.70278405e+13,  1.44907729e+13,  8.57317477e+12,  5.34265915e+12,
       -1.73828125e+00,  3.24002428e+12,  3.73486004e+12, -3.80468750e+00,
        2.80908203e+00, -7.67776695e+12, -7.47314453e-01,  1.00969433e+12,
       -5.85362148e+12,  4.89890727e+12,  4.89890727e+12, -1.31598017e+13,
       -3.19418986e+12, -5.74632658e+12,  3.51562500e-01, -1.72059213e+12,
       -3.75000000e-01, -3.08172869e+12, -1.08245971e+13, -5.13990004e+12,
        9.01254135e+12, -4.59858280e+11,  1.19055678e+13,  4.32748413e+00,
        1.30222195e+12,  2.53567355e+12, -1.29626932e+13,  2.15475799e+13,
       -2.62036133e+00, -5.70639791e+12,  1.19055678e+13,  1.19055678e+13,
        1.19055678e+13,  3.48431767e+12,  1.36113655e+12,  5.87981618e+12,
        2.21099204e+13,  5.54918774e+12,  1.58744766e+13, -1.55443462e+13,
        1.53808594e-01,  1.04855697e+12,  2.85872663e+12,  1.58744766e+13,
        1.26166743e+12,  4.87070809e+12, -3.23498687e+12, -1.22851562e+00,
       -8.39843750e-01,  6.13281250e-01,  3.03535714e+12,  7.81738281e-01,
        1.07473310e+13,  1.69191051e+12,  7.69220851e+12,  7.69220851e+12,
        1.04855697e+12, -5.84179688e+00,  4.89890727e+12, -6.63867188e+00,
       -1.53393558e+12, -1.34570312e+00,  1.26166743e+12, -9.88265894e+12,
       -3.72373549e+12, -1.36063944e+13, -2.53186023e+12, -3.72373549e+12,
       -9.88265894e+12, -2.53186023e+12, -5.93275949e+12,  1.07473310e+13,
       -7.63249705e+12,  1.26166743e+12, -4.89890727e+12, -4.89890727e+12,
       -2.03710938e+01, -2.41718750e+01,  1.19055678e+13,  1.26166743e+12,
       -1.08593750e+00, -1.82183250e+13, -3.47131270e+12, -1.51517029e+12,
       -1.13745096e+13, -4.35244288e+12, -1.53539774e+12, -5.89489746e+00,
        3.05761719e+00, -1.61132812e-01, -7.00203118e+12,  1.21251574e+12,
        1.21251574e+12,  9.75509615e+12,  1.13762060e+12, -1.62140701e+10,
        1.07473310e+13,  1.07473310e+13, -3.71132743e+12,  1.26166743e+12,
        2.84656419e+12,  3.56016313e+12, -3.95800781e+00,  6.01913322e+10,
        6.01913322e+10, -1.30157500e+13,  1.13515344e+12, -7.01063802e+12,
        2.09375000e+00, -1.15851460e+13, -7.27696573e+12, -5.30917348e+12,
       -3.96590453e+12,  1.14819822e+12, -8.59375000e-02, -2.59179688e+00,
       -3.28125000e-01, -1.54296875e+00,  7.14851421e+11, -1.54882812e+00,
        1.04101562e+00,  9.76562500e-04, -4.53179287e+12, -4.24124914e+12,
       -1.92171996e+11, -1.92171996e+11, -2.58988830e+11, -1.51433867e+12,
       -1.79408320e+13,  2.66960825e+12,  5.70312500e-01, -3.00283893e+12,
        1.14625880e+13,  1.80152192e+12,  7.80690721e+12, -2.58244465e+12,
        4.94596919e+11, -2.23144531e+00, -2.21389706e+12, -1.43855371e+12,
        5.59704841e+12,  2.34019846e+12, -2.14843750e-02,  3.71132743e+12,
        3.41179337e+12,  1.99688270e+11, -1.26404746e+14, -6.43329957e+13,
       -1.32493189e+11, -1.23049078e+12,  2.58988830e+11, -1.28959537e+13,
        1.11572266e+00, -2.65076588e+12, -9.07264196e+11, -9.07264196e+11,
       -2.97066979e+12,  5.09964794e+12, -3.62401878e+12,  6.19592649e+12,
       -1.19488958e+13,  2.09330922e+13, -4.49160719e+12,  4.28881836e+00,
        7.02237432e+11,  2.21529587e+12, -2.45953369e+00,  2.21529587e+12,
        3.59033203e+00, -1.54296875e-01,  3.44824219e+00,  7.14843750e-01,
        2.21529587e+12, -7.71334568e+12, -5.35032398e+11, -1.87044248e+13,
        6.68942733e+12, -1.14355453e+13,  5.59229522e+12,  1.21251574e+12,
        1.21251574e+12,  1.21251574e+12,  1.21251574e+12,  9.75509615e+12,
       -9.51904297e-01,  3.84576582e+11,  2.52420167e+12, -2.10351562e+00,
       -3.33300781e+00, -1.79314923e+12,  4.10156250e-01,  1.48114747e+11,
       -1.48114747e+11, -4.12606057e+09, -2.16527909e+12,  2.16527909e+12,
       -1.48388672e+00, -1.61621094e-01,  4.03145863e+12,  4.29583061e+13,
        1.08452430e+13,  1.08452430e+13, -1.95759347e+13, -1.53464731e+12,
       -3.17052873e+12, -5.25767107e+12,  9.79756955e+12, -4.58398438e+00,
        4.92342597e+12, -1.11332293e+13, -2.22968474e+12,  6.90998816e+12,
       -5.21273351e+12, -8.58337402e-01, -1.26513672e+01,  5.18869756e+11,
       -2.27343750e+00, -1.49830951e+12, -2.60326184e+12,  2.05183746e+12,
        3.42340088e+00,  1.49643573e+12,  1.12500000e+00,  4.06567383e+00,
       -1.46875000e+00, -6.38320129e+11, -2.97028290e+12,  3.05029297e+00,
        1.07473310e+13, -7.63249705e+12, -7.63249705e+12,  1.07473310e+13,
       -7.63249705e+12,  1.07473310e+13,  1.07473310e+13,  1.26166743e+12,
        1.07473310e+13,  1.07473310e+13,  3.08007812e+00,  1.07473310e+13,
        1.07473310e+13,  1.26166743e+12, -7.16200726e+11, -1.70728486e+13,
        7.83203125e-01, -4.02343750e-01,  1.52734375e+00, -6.25000000e+00,
        2.23876953e+00,  7.92968750e-01,  4.93164062e-02,  2.51953125e+00,
       -4.84838867e+00, -1.11328125e-01, -9.72656250e-01, -1.69191051e+12,
        1.82554664e+12, -3.63964844e+00,  1.70278405e+13, -7.14004836e+10,
       -4.23196435e+11,  1.03149414e+01,  2.86914062e+00,  7.92298568e+12,
       -7.62136053e+12,  1.09713211e+12, -1.85516909e+12, -1.85516909e+12,
       -3.48610689e+12, -3.48610689e+12, -3.48610689e+12, -3.48610689e+12,
        5.99955663e+12, -6.19362029e+12,  8.77639435e+12,  5.99955663e+12,
        4.18246473e+12,  1.38666992e+01, -1.08948975e+01,  1.13957317e+12,
        4.87621733e+12,  1.11713958e+13, -9.45685342e+11,  3.50065012e+12,
       -1.56744281e+12, -2.93068987e+12,  3.63922039e+12, -3.57312876e+12,
        8.73206697e+11, -3.34891543e+11,  4.11144392e+12, -6.14447407e+12,
        1.49496417e+12,  8.92330455e+11, -2.93068987e+12, -1.69813861e+12,
       -1.69813861e+12, -2.21199221e+12,  8.73206697e+11, -2.34057749e+12,
       -2.00660006e+13,  2.12488172e+12,  2.12488172e+12, -6.98374136e+12,
       -1.69813861e+12, -2.34057749e+12, -3.55400500e+12,  8.92330455e+11,
        3.01497735e+12, -9.79440939e+11,  2.10575797e+12,  2.10575797e+12,
        1.51408793e+12,  2.87889265e+12,  2.40666913e+12, -1.56744281e+12,
        2.87889265e+12,  3.50065012e+12, -2.93068987e+12, -2.32145374e+12,
        2.12488172e+12,  8.92330455e+11, -3.57312876e+12,  8.73206697e+11,
        2.10575797e+12, -3.34891543e+11,  4.11144392e+12,  3.14109505e+11,
       -1.51143713e+12, -1.51143713e+12, -1.51143713e+12, -1.51143713e+12,
       -1.51143713e+12,  1.45884577e+12, -1.31276211e+10, -1.31276211e+10,
       -1.51143713e+12, -8.29625038e+12,  1.09182470e+12,  1.09182470e+12,
       -1.51143713e+12, -1.31276211e+10, -1.31276211e+10, -1.51143713e+12,
       -1.31276211e+10, -1.31276211e+10, -1.51143713e+12, -1.51143713e+12,
       -1.51143713e+12, -1.51143713e+12, -1.51143713e+12, -1.51143713e+12,
       -1.51143713e+12, -1.51143713e+12, -1.33469030e+13,  6.23168585e+12,
        7.87742463e+11, -8.88032597e+12,  0.00000000e+00,  6.85042710e+11,
        4.57832432e+12, -3.57115249e+12, -8.88032597e+12, -3.00901745e+11,
       -3.69509838e+12, -8.88032597e+12,  2.46354137e+12,  1.17849839e+13,
       -1.86968795e+12,  7.10427757e+12, -7.09520766e+13, -3.44096654e+12,
        2.27305971e+12, -2.15128473e+12, -7.63911980e+12,  4.57317102e+12,
       -4.29160004e+13, -1.66864267e+13,  1.43370849e+13,  1.07976313e+13,
        7.40946881e+13,  4.07400850e+12, -7.91311631e+12,  1.70082767e+13,
       -6.39044993e+12, -4.81197942e+12, -2.88888867e+12, -8.88032597e+12,
        4.16408475e+13,  1.92871745e+12,  1.35814843e+13,  2.79681485e+12,
       -8.88032597e+12, -1.43489167e+12, -5.86091475e+11, -7.42063733e+12,
       -1.68688888e+12, -6.30149238e+12, -9.79216829e+11, -9.79216829e+11,
       -9.79216829e+11, -9.79216829e+11, -9.79216829e+11, -9.79216829e+11,
       -9.79216829e+11,  4.05071270e+11,  4.05071270e+11,  4.05071270e+11,
        4.05071270e+11, -8.95267535e+11, -8.95267535e+11, -8.95267535e+11,
       -8.95267535e+11, -8.95267535e+11, -8.95267535e+11, -8.95267535e+11,
       -8.95267535e+11, -8.95267535e+11, -8.95267535e+11, -8.95267535e+11,
       -8.95267535e+11,  0.00000000e+00, -8.95267535e+11, -8.95267535e+11,
       -8.95267535e+11, -8.95267535e+11, -8.95267535e+11, -8.95267535e+11,
       -8.95267535e+11, -8.95267535e+11, -8.95267535e+11, -8.95267535e+11,
       -8.95267535e+11, -3.73840595e+11, -8.95267535e+11, -8.95267535e+11,
       -7.62774346e+11,  0.00000000e+00, -6.07372992e+11, -6.07372992e+11,
       -6.07372992e+11, -6.07372992e+11, -6.07372992e+11, -6.07372992e+11,
       -6.07372992e+11, -6.07372992e+11, -6.07372992e+11, -6.07372992e+11,
       -6.07372992e+11, -6.07372992e+11,  3.73733721e+12,  3.73733721e+12,
        3.73733721e+12,  3.73733721e+12,  3.73733721e+12,  3.73733721e+12,
        3.73733721e+12,  3.73733721e+12,  3.73733721e+12,  3.73733721e+12,
        3.73733721e+12,  3.73733721e+12,  3.73733721e+12,  3.73733721e+12,
        3.73733721e+12,  3.73733721e+12,  3.73733721e+12,  3.73733721e+12,
        3.73733721e+12,  3.73733721e+12,  3.73733721e+12,  3.73733721e+12,
        3.73733721e+12,  3.73733721e+12,  3.73733721e+12])
In [192]:
linreg.intercept_
Out[192]:
-8038426659044.718
In [193]:
Y_test.shape
Out[193]:
(1263,)
In [204]:
 #Making predictions
# make predictions on the testing set
Y_pred = linreg.predict(X_test)
In [196]:
Y_pred.shape
Out[196]:
(1263,)
In [205]:
# import libraries for metrics 
import numpy as np
from sklearn import metrics

# Model evaluation metrics for regression
#print('y-intercept             : ', linreg.intercept_)
#print('beta coefficients       : ', linreg.coef_)
print('Mean Abs Error   MAE    : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq  Error MSE      : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE                    : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE                     : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value                : ', metrics.r2_score(Y_test, Y_pred))
Mean Abs Error   MAE    :  86239746006.26898
Mean Sq  Error MSE      :  2.9354845283889697e+24
Root Mean Sq Error RMSE :  1713325575712.0332
MAPE                    :  95008746435.94705
MPE                     :  85209316837.01973
r2 value                :  -1.958829699419431e+22

Lets try KKN for regression

In [206]:
from sklearn import neighbors
In [207]:
# Modelling 
clf = neighbors.KNeighborsRegressor()
clf.fit(X_train, Y_train)
Out[207]:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')
In [208]:
Y_pred=clf.predict(X_test)
In [209]:
Y_pred
Out[209]:
array([ 83.958, 108.846, 109.786, ..., 116.314,  81.93 ,  93.062])
In [210]:
Y_test
Out[210]:
array([ 73.36, 117.59, 107.74, ..., 110.44,  75.41,  88.62])

KNN metrics

In [211]:
# import libraries for metrics and reporting
import numpy as np
from sklearn import metrics
# Model evaluation metrics for regression
#print('y-intercept             : ', linreg.intercept_)
#print('beta coefficients       : ', linreg.coef_)
print('Mean Abs Error   MAE    : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq  Error MSE      : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE                    : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE                     : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value                : ', metrics.r2_score(Y_test, Y_pred))
Mean Abs Error   MAE    :  6.1405114806017425
Mean Sq  Error MSE      :  80.14530435154396
Root Mean Sq Error RMSE :  8.95239098518066
MAPE                    :  5.956790750464435
MPE                     :  -0.9069589578306889
r2 value                :  0.46519560599090515

Lets try Decision Tree for regression

In [217]:
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor()
clf = clf.fit(X_train, Y_train)
In [218]:
Y_pred=clf.predict(X_test)
In [214]:
Y_pred
Out[214]:
array([ 98.81, 109.66, 107.66, ..., 116.36,  98.81,  95.24])
In [215]:
Y_test
Out[215]:
array([ 73.36, 117.59, 107.74, ..., 110.44,  75.41,  88.62])

Decision Tree metrics

In [219]:
# import libraries for metrics and reporting
import numpy as np
from sklearn import metrics
# Model evaluation metrics for regression
#print('y-intercept             : ', linreg.intercept_)
#print('beta coefficients       : ', linreg.coef_)
print('Mean Abs Error   MAE    : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq  Error MSE      : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE                    : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE                     : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value                : ', metrics.r2_score(Y_test, Y_pred))
Mean Abs Error   MAE    :  7.576384006334125
Mean Sq  Error MSE      :  174.44319114445327
Root Mean Sq Error RMSE :  13.20769439169658
MAPE                    :  7.432946144283889
MPE                     :  -1.193877814069242
r2 value                :  -0.1640480485270599

Lets try Random Forest for regression

In [130]:
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
clf = clf.fit(X_train, Y_train)
In [131]:
Y_pred=clf.predict(X_test)
In [232]:
Y_pred
Out[232]:
array([ 85.13778333, 109.8769    , 109.612     , ..., 117.5199    ,
        89.5909    ,  95.0303    ])
In [229]:
Y_test
Out[229]:
array([ 73.36, 117.59, 107.74, ..., 110.44,  75.41,  88.62])

Random Forest metrics

In [233]:
# import libraries for metrics and reporting
import numpy as np
from sklearn import metrics
# Model evaluation metrics for regression
#print('y-intercept             : ', linreg.intercept_)
#print('beta coefficients       : ', linreg.coef_)
print('Mean Abs Error   MAE    : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq  Error MSE      : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE                    : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE                     : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value                : ', metrics.r2_score(Y_test, Y_pred))
Mean Abs Error   MAE    :  5.612367513301778
Mean Sq  Error MSE      :  83.32863201151974
Root Mean Sq Error RMSE :  9.128451786120127
MAPE                    :  5.43357654940373
MPE                     :  -0.8660620180439603
r2 value                :  0.44395346792804136

Some tunning on Random Forest

With K-Folds cross-validator

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

In [3]:
from sklearn.model_selection import KFold
KF= KFold(n_splits=10)

Parameters of RF regressor

  1. n_estimators int, default=100 -The number of trees in the forest.
  2. min_samples_split, int or float, default=2 -- The minimum number of samples required to split an internal node:
  3. max_depth, int, default=None, The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  4. criterion {“mse”, “mae”}, default=”mse” The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.
  5. max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto” The number of features to consider when looking for the best split:
In [132]:
clf = RandomForestRegressor(n_estimators=50,min_samples_split=0.1, max_depth=10, criterion='mse', max_features='sqrt')
clf = clf.fit(X_train, Y_train)

Evaluate a score by cross-validation

  1. estimator= model to fit, here we are using RF
  2. cv,int, cross-validation generator or an iterable, default=None. Determines the cross-validation splitting strategy. We will use 10 from KF 3.scoring-str or callable, default=None A str (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y) which should return only a single value. Similar to cross_validate but only a single metric is permitted.
In [259]:
from sklearn.model_selection import cross_val_score

KFresult=cross_val_score(estimator=clf,X=X,y=Y,cv=KF, scoring='neg_mean_squared_error')
print( 'Mean Squared Error : ', KFresult.mean())
Mean Squared Error :  -77.55809997635885

Feature elimination

In [295]:
# from F regression on numeric cols 
dropcols
# columns with only one value in them
dropcols2=newdf[newdf['fscore'].isnull()].Colname.values
# from Annova on Cat columns- X4 can be discarded
In [307]:
rawdata=pd.read_csv('./train/train.csv')
# dropping x4
rawdata = rawdata.drop(['X4'],axis=1)

#dropping dropcols columns
dropcols=pd.Series(list(dropcols))
rawdata=rawdata.drop(rawdata[dropcols],axis=1)

#dropping dropcols2 columns
dropcols=pd.Series(list(dropcols2))
rawdata=rawdata.drop(rawdata[dropcols2],axis=1)

Re-trying Machine Learning again after feature elimination

In [310]:
#One hot encoding

import pandas as pd
rawdata=pd.get_dummies(rawdata)

rawdata
Out[310]:
ID y X12 X13 X14 X16 X17 X19 X20 X21 ... X8_p X8_q X8_r X8_s X8_t X8_u X8_v X8_w X8_x X8_y
0 0 130.81 0 1 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
1 6 88.53 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 7 76.26 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
3 9 80.62 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 13 78.02 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4204 8405 107.39 0 0 1 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
4205 8406 108.77 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4206 8412 109.22 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4207 8415 87.48 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
4208 8417 110.85 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 1 0 0

4209 rows × 435 columns

In [311]:
# Collecting X and Y
X = rawdata.drop(['ID','y'],axis=1).values
Y = rawdata['y'].values
In [313]:
Y
Out[313]:
array([130.81,  88.53,  76.26, ..., 109.22,  87.48, 110.85])
In [46]:
# Splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=1, test_size=0.3)

print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)
(2946, 563)
(2946,)
(1263, 563)
(1263,)
In [66]:
## KKN for regression

from sklearn import neighbors


# Modelling 
clf = neighbors.KNeighborsRegressor(n_neighbors=17, metric='hamming', weights= 'distance')
clf.fit(X_train, Y_train)

Y_pred=clf.predict(X_test)

Params for KNN

1.n_neighborsint, default=5. Number of neighbors to use by default for kneighbors queries.

2.weights{‘uniform’, ‘distance’} or callable, default=’uniform’ weight function used in prediction. Uniform weights are used by default.

3.algorithm {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’. Algorithm used to compute the nearest neighbors.

  1. metric-str or callable, default=’minkowski’. The distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric.
In [67]:
## KNN metrics

# import libraries for metrics and reporting
import numpy as np
from sklearn import metrics
# Model evaluation metrics for regression
#print('y-intercept             : ', linreg.intercept_)
#print('beta coefficients       : ', linreg.coef_)
print('Mean Abs Error   MAE    : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq  Error MSE      : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE                    : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE                     : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value                : ', metrics.r2_score(Y_test, Y_pred))
Mean Abs Error   MAE    :  6.0120967701828905
Mean Sq  Error MSE      :  69.71454972762011
Root Mean Sq Error RMSE :  8.349523922213775
MAPE                    :  5.842483696558626
MPE                     :  -0.7464194805263897
r2 value                :  0.5334926312302448

With Kfold

In [68]:
from sklearn.model_selection import cross_val_score

KFresult=cross_val_score(estimator=clf,X=X,y=Y,cv=KF, scoring='neg_mean_squared_error')
print( 'Mean Squared Error : ', KFresult.mean())
Mean Squared Error :  -85.8781184879734
In [69]:
## Lets try Decision Tree for regression

from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor()
clf = clf.fit(X_train, Y_train)

Y_pred=clf.predict(X_test)
In [70]:
## Decision Tree metrics

# import libraries for metrics and reporting
import numpy as np
from sklearn import metrics
# Model evaluation metrics for regression
#print('y-intercept             : ', linreg.intercept_)
#print('beta coefficients       : ', linreg.coef_)
print('Mean Abs Error   MAE    : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq  Error MSE      : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE                    : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE                     : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value                : ', metrics.r2_score(Y_test, Y_pred))
Mean Abs Error   MAE    :  7.077850356294537
Mean Sq  Error MSE      :  110.57064731038695
Root Mean Sq Error RMSE :  10.515257833756952
MAPE                    :  6.895123333092348
MPE                     :  -0.5647868701924015
r2 value                :  0.2600967525219343
In [71]:
## Lets try Random Forest for regression

from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
clf = clf.fit(X_train, Y_train)

Y_pred=clf.predict(X_test)
In [ ]:
## Random Forest metrics

# import libraries for metrics and reporting
import numpy as np
from sklearn import metrics
# Model evaluation metrics for regression
#print('y-intercept             : ', linreg.intercept_)
#print('beta coefficients       : ', linreg.coef_)
print('Mean Abs Error   MAE    : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq  Error MSE      : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE                    : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE                     : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value                : ', metrics.r2_score(Y_test, Y_pred))
In [41]:
## tunning on Random Forest

from sklearn.model_selection import KFold
KF= KFold(n_splits=10)


clf = RandomForestRegressor(n_estimators=50,min_samples_split=0.1, max_depth=10, criterion='mse', max_features='sqrt')
clf = clf.fit(X_train, Y_train)


from sklearn.model_selection import cross_val_score

KFresult=cross_val_score(estimator=clf,X=X,y=Y,cv=KF, scoring='neg_mean_squared_error')
print( 'Mean Squared Error : ', KFresult.mean())
Mean Squared Error :  -77.13722907425915

Trying Gradient Boosting

Params

  1. n_estimatorsint, default=100, The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.
  2. min_samples_splitint or float, default=2.The minimum number of samples required to split an internal node:
  3. learning_ratefloat, default=0.1. learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.
  4. loss{‘ls’, ‘lad’, ‘huber’, ‘quantile’}, default=’ls’. loss function to be optimized. ‘ls’ refers to least squares regression. ‘lad’ (least absolute deviation) is a highly robust loss function solely based on order information of the input variables. ‘huber’ is a combination of the two. ‘quantile’ allows quantile regression
  5. criterion{‘friedman_mse’, ‘mse’, ‘mae’}, default=’friedman_mse’. The function to measure the quality of a split. Supported criteria are “friedman_mse” for the mean squared error with improvement score by Friedman, “mse” for mean squared error, and “mae” for the mean absolute error. The default value of “friedman_mse” is generally the best as it can provide a better approximation in some cases.
In [31]:
# Fit regression model
from sklearn.ensemble import GradientBoostingRegressor

params = {'n_estimators': 1500, 
          'max_depth': 4, 
          'min_samples_split': 2,
          'learning_rate': 0.005, 
          'loss': 'ls'}

gbr = GradientBoostingRegressor(**params)
In [32]:
# Train GB regressor
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=1, test_size=0.3)

gbr.fit(X_train, Y_train)
Out[32]:
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.005, loss='ls',
                          max_depth=4, max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=1500,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)
In [45]:
from sklearn.model_selection import cross_val_score

KFresult=cross_val_score(estimator=gbr,X=X,y=Y,cv=KF, scoring='neg_mean_squared_error')
print( 'Mean Squared Error : ', KFresult.mean())
Mean Squared Error :  -71.44837120067895
In [ ]:
# Tuning Gradient Boost
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
KF=KFold(n_splits=5, random_state=20)

scor={'r2':'r2', 'MSE':'neg_mean_squared_error'}

scores=cross_validate(estimator=gbr,X=X_train,y=Y_train,cv=KF,scoring=scor,return_train_score=True)

#scores.keys()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:296: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
  FutureWarning
In [46]:
print('Train MSE')
print(scores['train_MSE'].mean())
print('Train R2')
print(scores['train_r2'].mean())
print('-------------vs---------------')
print('Test MSE')
print(scores['test_MSE'].mean())
print('Test R2')
print(scores['test_r2'].mean())
Train MSE
-48.52062078593129
Train R2
0.7068512147562658
-------vs----------
Test MSE
-77.05777649444737
Test R2
0.5420662196735713

Trying XG Boost

In [47]:
pip install xgboost
Requirement already satisfied: xgboost in c:\programdata\anaconda3\lib\site-packages (1.0.2)
Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (from xgboost) (1.18.1)
Requirement already satisfied: scipy in c:\users\dell\appdata\roaming\python\python37\site-packages (from xgboost) (1.4.1)
Note: you may need to restart the kernel to use updated packages.
In [72]:
# train test split
X_train, X_test, Y_train, Y_test = train_test_split(X, 
                                                    Y, 
                                                    test_size=0.2, 
                                                    random_state=123)
In [87]:
X_test
Out[87]:
array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)
In [73]:
import xgboost as xgb
train = xgb.DMatrix(X_train,Y_train)
test  = xgb.DMatrix(X_test, Y_test)
In [97]:
# parameters for tuning
params={'max_depth': 7,
 'min_child_weight': 2,
 'eta': 0.005,
 'subsample': 0.8,
 'colsample_bytree': 1,
 'objective': 'reg:linear',
 'eval_metric': 'mae'}
In [98]:
num_boost_round = 999
In [100]:
%%time
model = xgb.train(
                params,
                train,
    num_boost_round=num_boost_round,
                evals=[(test, "Test")],
                early_stopping_rounds=10
)
[09:56:54] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.0.0/src/objective/regression_obj.cu:167: reg:linear is now deprecated in favor of reg:squarederror.
[0]	Test-mae:99.55287
Will train until Test-mae hasn't improved in 10 rounds.
[1]	Test-mae:99.05477
[2]	Test-mae:98.55984
[3]	Test-mae:98.06717
[4]	Test-mae:97.57676
[5]	Test-mae:97.08901
[6]	Test-mae:96.60383
[7]	Test-mae:96.12118
[8]	Test-mae:95.64020
[9]	Test-mae:95.16235
[10]	Test-mae:94.68639
[11]	Test-mae:94.21329
[12]	Test-mae:93.74271
[13]	Test-mae:93.27374
[14]	Test-mae:92.80702
[15]	Test-mae:92.34370
[16]	Test-mae:91.88194
[17]	Test-mae:91.42281
[18]	Test-mae:90.96605
[19]	Test-mae:90.51138
[20]	Test-mae:90.05868
[21]	Test-mae:89.60834
[22]	Test-mae:89.16090
[23]	Test-mae:88.71549
[24]	Test-mae:88.27136
[25]	Test-mae:87.82980
[26]	Test-mae:87.39087
[27]	Test-mae:86.95386
[28]	Test-mae:86.51924
[29]	Test-mae:86.08704
[30]	Test-mae:85.65672
[31]	Test-mae:85.22840
[32]	Test-mae:84.80257
[33]	Test-mae:84.37823
[34]	Test-mae:83.95626
[35]	Test-mae:83.53617
[36]	Test-mae:83.11795
[37]	Test-mae:82.70149
[38]	Test-mae:82.28816
[39]	Test-mae:81.87693
[40]	Test-mae:81.46790
[41]	Test-mae:81.06045
[42]	Test-mae:80.65582
[43]	Test-mae:80.25182
[44]	Test-mae:79.85043
[45]	Test-mae:79.45104
[46]	Test-mae:79.05379
[47]	Test-mae:78.65813
[48]	Test-mae:78.26514
[49]	Test-mae:77.87331
[50]	Test-mae:77.48425
[51]	Test-mae:77.09710
[52]	Test-mae:76.71224
[53]	Test-mae:76.32861
[54]	Test-mae:75.94731
[55]	Test-mae:75.56828
[56]	Test-mae:75.19081
[57]	Test-mae:74.81474
[58]	Test-mae:74.44051
[59]	Test-mae:74.06780
[60]	Test-mae:73.69694
[61]	Test-mae:73.32819
[62]	Test-mae:72.96126
[63]	Test-mae:72.59606
[64]	Test-mae:72.23238
[65]	Test-mae:71.87192
[66]	Test-mae:71.51240
[67]	Test-mae:71.15520
[68]	Test-mae:70.79929
[69]	Test-mae:70.44524
[70]	Test-mae:70.09248
[71]	Test-mae:69.74152
[72]	Test-mae:69.39284
[73]	Test-mae:69.04620
[74]	Test-mae:68.70079
[75]	Test-mae:68.35679
[76]	Test-mae:68.01476
[77]	Test-mae:67.67418
[78]	Test-mae:67.33502
[79]	Test-mae:66.99756
[80]	Test-mae:66.66293
[81]	Test-mae:66.32928
[82]	Test-mae:65.99765
[83]	Test-mae:65.66763
[84]	Test-mae:65.33891
[85]	Test-mae:65.01173
[86]	Test-mae:64.68658
[87]	Test-mae:64.36262
[88]	Test-mae:64.04034
[89]	Test-mae:63.71988
[90]	Test-mae:63.40104
[91]	Test-mae:63.08371
[92]	Test-mae:62.76854
[93]	Test-mae:62.45436
[94]	Test-mae:62.14209
[95]	Test-mae:61.83061
[96]	Test-mae:61.52140
[97]	Test-mae:61.21337
[98]	Test-mae:60.90735
[99]	Test-mae:60.60262
[100]	Test-mae:60.29965
[101]	Test-mae:59.99721
[102]	Test-mae:59.69692
[103]	Test-mae:59.39850
[104]	Test-mae:59.10136
[105]	Test-mae:58.80505
[106]	Test-mae:58.51179
[107]	Test-mae:58.21899
[108]	Test-mae:57.92754
[109]	Test-mae:57.63799
[110]	Test-mae:57.35028
[111]	Test-mae:57.06340
[112]	Test-mae:56.77812
[113]	Test-mae:56.49365
[114]	Test-mae:56.21116
[115]	Test-mae:55.93031
[116]	Test-mae:55.64987
[117]	Test-mae:55.37169
[118]	Test-mae:55.09458
[119]	Test-mae:54.81935
[120]	Test-mae:54.54496
[121]	Test-mae:54.27174
[122]	Test-mae:54.00026
[123]	Test-mae:53.73002
[124]	Test-mae:53.46128
[125]	Test-mae:53.19382
[126]	Test-mae:52.92781
[127]	Test-mae:52.66275
[128]	Test-mae:52.39899
[129]	Test-mae:52.13683
[130]	Test-mae:51.87609
[131]	Test-mae:51.61692
[132]	Test-mae:51.35845
[133]	Test-mae:51.10117
[134]	Test-mae:50.84582
[135]	Test-mae:50.59167
[136]	Test-mae:50.33870
[137]	Test-mae:50.08693
[138]	Test-mae:49.83603
[139]	Test-mae:49.58638
[140]	Test-mae:49.33787
[141]	Test-mae:49.09137
[142]	Test-mae:48.84519
[143]	Test-mae:48.60057
[144]	Test-mae:48.35755
[145]	Test-mae:48.11566
[146]	Test-mae:47.87455
[147]	Test-mae:47.63492
[148]	Test-mae:47.39687
[149]	Test-mae:47.15933
[150]	Test-mae:46.92281
[151]	Test-mae:46.68782
[152]	Test-mae:46.45347
[153]	Test-mae:46.22083
[154]	Test-mae:45.98933
[155]	Test-mae:45.75894
[156]	Test-mae:45.52907
[157]	Test-mae:45.30076
[158]	Test-mae:45.07428
[159]	Test-mae:44.84878
[160]	Test-mae:44.62389
[161]	Test-mae:44.40013
[162]	Test-mae:44.17762
[163]	Test-mae:43.95613
[164]	Test-mae:43.73557
[165]	Test-mae:43.51653
[166]	Test-mae:43.29882
[167]	Test-mae:43.08213
[168]	Test-mae:42.86605
[169]	Test-mae:42.65134
[170]	Test-mae:42.43747
[171]	Test-mae:42.22464
[172]	Test-mae:42.01271
[173]	Test-mae:41.80231
[174]	Test-mae:41.59271
[175]	Test-mae:41.38446
[176]	Test-mae:41.17763
[177]	Test-mae:40.97132
[178]	Test-mae:40.76541
[179]	Test-mae:40.56131
[180]	Test-mae:40.35743
[181]	Test-mae:40.15481
[182]	Test-mae:39.95345
[183]	Test-mae:39.75264
[184]	Test-mae:39.55297
[185]	Test-mae:39.35468
[186]	Test-mae:39.15734
[187]	Test-mae:38.96166
[188]	Test-mae:38.76658
[189]	Test-mae:38.57180
[190]	Test-mae:38.37823
[191]	Test-mae:38.18602
[192]	Test-mae:37.99500
[193]	Test-mae:37.80484
[194]	Test-mae:37.61579
[195]	Test-mae:37.42750
[196]	Test-mae:37.23940
[197]	Test-mae:37.05220
[198]	Test-mae:36.86663
[199]	Test-mae:36.68156
[200]	Test-mae:36.49748
[201]	Test-mae:36.31457
[202]	Test-mae:36.13224
[203]	Test-mae:35.95164
[204]	Test-mae:35.77175
[205]	Test-mae:35.59246
[206]	Test-mae:35.41395
[207]	Test-mae:35.23665
[208]	Test-mae:35.05978
[209]	Test-mae:34.88416
[210]	Test-mae:34.70954
[211]	Test-mae:34.53579
[212]	Test-mae:34.36375
[213]	Test-mae:34.19096
[214]	Test-mae:34.01978
[215]	Test-mae:33.84952
[216]	Test-mae:33.67986
[217]	Test-mae:33.51084
[218]	Test-mae:33.34329
[219]	Test-mae:33.17529
[220]	Test-mae:33.00895
[221]	Test-mae:32.84339
[222]	Test-mae:32.67837
[223]	Test-mae:32.51400
[224]	Test-mae:32.35045
[225]	Test-mae:32.18803
[226]	Test-mae:32.02709
[227]	Test-mae:31.86659
[228]	Test-mae:31.70727
[229]	Test-mae:31.54833
[230]	Test-mae:31.38935
[231]	Test-mae:31.23219
[232]	Test-mae:31.07569
[233]	Test-mae:30.91992
[234]	Test-mae:30.76477
[235]	Test-mae:30.61088
[236]	Test-mae:30.45750
[237]	Test-mae:30.30488
[238]	Test-mae:30.15329
[239]	Test-mae:30.00230
[240]	Test-mae:29.85187
[241]	Test-mae:29.70165
[242]	Test-mae:29.55308
[243]	Test-mae:29.40504
[244]	Test-mae:29.25857
[245]	Test-mae:29.11161
[246]	Test-mae:28.96560
[247]	Test-mae:28.81990
[248]	Test-mae:28.67568
[249]	Test-mae:28.53202
[250]	Test-mae:28.38902
[251]	Test-mae:28.24643
[252]	Test-mae:28.10532
[253]	Test-mae:27.96367
[254]	Test-mae:27.82352
[255]	Test-mae:27.68425
[256]	Test-mae:27.54555
[257]	Test-mae:27.40751
[258]	Test-mae:27.27023
[259]	Test-mae:27.13343
[260]	Test-mae:26.99650
[261]	Test-mae:26.86145
[262]	Test-mae:26.72593
[263]	Test-mae:26.59215
[264]	Test-mae:26.45917
[265]	Test-mae:26.32713
[266]	Test-mae:26.19479
[267]	Test-mae:26.06351
[268]	Test-mae:25.93345
[269]	Test-mae:25.80376
[270]	Test-mae:25.67473
[271]	Test-mae:25.54646
[272]	Test-mae:25.41874
[273]	Test-mae:25.29121
[274]	Test-mae:25.16458
[275]	Test-mae:25.03803
[276]	Test-mae:24.91290
[277]	Test-mae:24.78798
[278]	Test-mae:24.66371
[279]	Test-mae:24.53953
[280]	Test-mae:24.41636
[281]	Test-mae:24.29407
[282]	Test-mae:24.17281
[283]	Test-mae:24.05262
[284]	Test-mae:23.93206
[285]	Test-mae:23.81271
[286]	Test-mae:23.69360
[287]	Test-mae:23.57515
[288]	Test-mae:23.45717
[289]	Test-mae:23.33967
[290]	Test-mae:23.22368
[291]	Test-mae:23.10665
[292]	Test-mae:22.99026
[293]	Test-mae:22.87525
[294]	Test-mae:22.76112
[295]	Test-mae:22.64731
[296]	Test-mae:22.53356
[297]	Test-mae:22.42067
[298]	Test-mae:22.30822
[299]	Test-mae:22.19627
[300]	Test-mae:22.08564
[301]	Test-mae:21.97549
[302]	Test-mae:21.86584
[303]	Test-mae:21.75677
[304]	Test-mae:21.64760
[305]	Test-mae:21.53949
[306]	Test-mae:21.43180
[307]	Test-mae:21.32454
[308]	Test-mae:21.21715
[309]	Test-mae:21.11120
[310]	Test-mae:21.00520
[311]	Test-mae:20.90045
[312]	Test-mae:20.79592
[313]	Test-mae:20.69179
[314]	Test-mae:20.58766
[315]	Test-mae:20.48488
[316]	Test-mae:20.38289
[317]	Test-mae:20.28112
[318]	Test-mae:20.18004
[319]	Test-mae:20.07966
[320]	Test-mae:19.97891
[321]	Test-mae:19.87965
[322]	Test-mae:19.78093
[323]	Test-mae:19.68204
[324]	Test-mae:19.58318
[325]	Test-mae:19.48558
[326]	Test-mae:19.38822
[327]	Test-mae:19.29118
[328]	Test-mae:19.19441
[329]	Test-mae:19.09858
[330]	Test-mae:19.00308
[331]	Test-mae:18.90835
[332]	Test-mae:18.81375
[333]	Test-mae:18.71952
[334]	Test-mae:18.62629
[335]	Test-mae:18.53336
[336]	Test-mae:18.44026
[337]	Test-mae:18.34821
[338]	Test-mae:18.25600
[339]	Test-mae:18.16473
[340]	Test-mae:18.07416
[341]	Test-mae:17.98408
[342]	Test-mae:17.89420
[343]	Test-mae:17.80512
[344]	Test-mae:17.71557
[345]	Test-mae:17.62695
[346]	Test-mae:17.53937
[347]	Test-mae:17.45155
[348]	Test-mae:17.36463
[349]	Test-mae:17.27750
[350]	Test-mae:17.19186
[351]	Test-mae:17.10629
[352]	Test-mae:17.02115
[353]	Test-mae:16.93566
[354]	Test-mae:16.85067
[355]	Test-mae:16.76719
[356]	Test-mae:16.68483
[357]	Test-mae:16.60231
[358]	Test-mae:16.51978
[359]	Test-mae:16.43705
[360]	Test-mae:16.35446
[361]	Test-mae:16.27213
[362]	Test-mae:16.19111
[363]	Test-mae:16.11145
[364]	Test-mae:16.03209
[365]	Test-mae:15.95196
[366]	Test-mae:15.87193
[367]	Test-mae:15.79168
[368]	Test-mae:15.71298
[369]	Test-mae:15.63442
[370]	Test-mae:15.55643
[371]	Test-mae:15.47911
[372]	Test-mae:15.40217
[373]	Test-mae:15.32436
[374]	Test-mae:15.24698
[375]	Test-mae:15.17101
[376]	Test-mae:15.09475
[377]	Test-mae:15.02086
[378]	Test-mae:14.94711
[379]	Test-mae:14.87272
[380]	Test-mae:14.79885
[381]	Test-mae:14.72469
[382]	Test-mae:14.65115
[383]	Test-mae:14.57731
[384]	Test-mae:14.50402
[385]	Test-mae:14.43206
[386]	Test-mae:14.36079
[387]	Test-mae:14.28854
[388]	Test-mae:14.21720
[389]	Test-mae:14.14658
[390]	Test-mae:14.07633
[391]	Test-mae:14.00598
[392]	Test-mae:13.93651
[393]	Test-mae:13.86793
[394]	Test-mae:13.79884
[395]	Test-mae:13.72987
[396]	Test-mae:13.66183
[397]	Test-mae:13.59329
[398]	Test-mae:13.52585
[399]	Test-mae:13.45838
[400]	Test-mae:13.39066
[401]	Test-mae:13.32435
[402]	Test-mae:13.25772
[403]	Test-mae:13.19094
[404]	Test-mae:13.12562
[405]	Test-mae:13.05964
[406]	Test-mae:12.99445
[407]	Test-mae:12.92882
[408]	Test-mae:12.86503
[409]	Test-mae:12.80049
[410]	Test-mae:12.73602
[411]	Test-mae:12.67270
[412]	Test-mae:12.61016
[413]	Test-mae:12.54699
[414]	Test-mae:12.48541
[415]	Test-mae:12.42312
[416]	Test-mae:12.36102
[417]	Test-mae:12.29931
[418]	Test-mae:12.23832
[419]	Test-mae:12.17757
[420]	Test-mae:12.11628
[421]	Test-mae:12.05654
[422]	Test-mae:11.99628
[423]	Test-mae:11.93740
[424]	Test-mae:11.87927
[425]	Test-mae:11.82024
[426]	Test-mae:11.76198
[427]	Test-mae:11.70299
[428]	Test-mae:11.64393
[429]	Test-mae:11.58637
[430]	Test-mae:11.52850
[431]	Test-mae:11.47159
[432]	Test-mae:11.41466
[433]	Test-mae:11.35880
[434]	Test-mae:11.30262
[435]	Test-mae:11.24542
[436]	Test-mae:11.18873
[437]	Test-mae:11.13275
[438]	Test-mae:11.07765
[439]	Test-mae:11.02277
[440]	Test-mae:10.96899
[441]	Test-mae:10.91456
[442]	Test-mae:10.86049
[443]	Test-mae:10.80782
[444]	Test-mae:10.75369
[445]	Test-mae:10.69994
[446]	Test-mae:10.64717
[447]	Test-mae:10.59460
[448]	Test-mae:10.54232
[449]	Test-mae:10.48924
[450]	Test-mae:10.43724
[451]	Test-mae:10.38585
[452]	Test-mae:10.33457
[453]	Test-mae:10.28311
[454]	Test-mae:10.23249
[455]	Test-mae:10.18239
[456]	Test-mae:10.13284
[457]	Test-mae:10.08315
[458]	Test-mae:10.03253
[459]	Test-mae:9.98275
[460]	Test-mae:9.93295
[461]	Test-mae:9.88314
[462]	Test-mae:9.83393
[463]	Test-mae:9.78553
[464]	Test-mae:9.73714
[465]	Test-mae:9.68866
[466]	Test-mae:9.64062
[467]	Test-mae:9.59320
[468]	Test-mae:9.54618
[469]	Test-mae:9.49831
[470]	Test-mae:9.45169
[471]	Test-mae:9.40506
[472]	Test-mae:9.35809
[473]	Test-mae:9.31143
[474]	Test-mae:9.26539
[475]	Test-mae:9.21997
[476]	Test-mae:9.17434
[477]	Test-mae:9.12984
[478]	Test-mae:9.08498
[479]	Test-mae:9.04087
[480]	Test-mae:8.99704
[481]	Test-mae:8.95402
[482]	Test-mae:8.91098
[483]	Test-mae:8.86783
[484]	Test-mae:8.82533
[485]	Test-mae:8.78230
[486]	Test-mae:8.73973
[487]	Test-mae:8.69636
[488]	Test-mae:8.65386
[489]	Test-mae:8.61147
[490]	Test-mae:8.57116
[491]	Test-mae:8.53004
[492]	Test-mae:8.48850
[493]	Test-mae:8.44675
[494]	Test-mae:8.40632
[495]	Test-mae:8.36646
[496]	Test-mae:8.32668
[497]	Test-mae:8.28589
[498]	Test-mae:8.24597
[499]	Test-mae:8.20642
[500]	Test-mae:8.16770
[501]	Test-mae:8.12790
[502]	Test-mae:8.08915
[503]	Test-mae:8.05075
[504]	Test-mae:8.01303
[505]	Test-mae:7.97518
[506]	Test-mae:7.93788
[507]	Test-mae:7.90012
[508]	Test-mae:7.86289
[509]	Test-mae:7.82621
[510]	Test-mae:7.79016
[511]	Test-mae:7.75358
[512]	Test-mae:7.71799
[513]	Test-mae:7.68264
[514]	Test-mae:7.64676
[515]	Test-mae:7.61130
[516]	Test-mae:7.57640
[517]	Test-mae:7.54144
[518]	Test-mae:7.50572
[519]	Test-mae:7.47072
[520]	Test-mae:7.43654
[521]	Test-mae:7.40192
[522]	Test-mae:7.36780
[523]	Test-mae:7.33511
[524]	Test-mae:7.30209
[525]	Test-mae:7.26894
[526]	Test-mae:7.23632
[527]	Test-mae:7.20353
[528]	Test-mae:7.17105
[529]	Test-mae:7.13962
[530]	Test-mae:7.10792
[531]	Test-mae:7.07655
[532]	Test-mae:7.04528
[533]	Test-mae:7.01412
[534]	Test-mae:6.98400
[535]	Test-mae:6.95496
[536]	Test-mae:6.92541
[537]	Test-mae:6.89679
[538]	Test-mae:6.86700
[539]	Test-mae:6.83774
[540]	Test-mae:6.80904
[541]	Test-mae:6.78085
[542]	Test-mae:6.75257
[543]	Test-mae:6.72438
[544]	Test-mae:6.69737
[545]	Test-mae:6.66937
[546]	Test-mae:6.64222
[547]	Test-mae:6.61522
[548]	Test-mae:6.58793
[549]	Test-mae:6.56100
[550]	Test-mae:6.53479
[551]	Test-mae:6.50907
[552]	Test-mae:6.48298
[553]	Test-mae:6.45692
[554]	Test-mae:6.43179
[555]	Test-mae:6.40738
[556]	Test-mae:6.38245
[557]	Test-mae:6.35814
[558]	Test-mae:6.33419
[559]	Test-mae:6.31074
[560]	Test-mae:6.28812
[561]	Test-mae:6.26482
[562]	Test-mae:6.24180
[563]	Test-mae:6.21918
[564]	Test-mae:6.19736
[565]	Test-mae:6.17499
[566]	Test-mae:6.15309
[567]	Test-mae:6.13132
[568]	Test-mae:6.10981
[569]	Test-mae:6.08796
[570]	Test-mae:6.06718
[571]	Test-mae:6.04589
[572]	Test-mae:6.02484
[573]	Test-mae:6.00521
[574]	Test-mae:5.98537
[575]	Test-mae:5.96602
[576]	Test-mae:5.94677
[577]	Test-mae:5.92769
[578]	Test-mae:5.90967
[579]	Test-mae:5.89159
[580]	Test-mae:5.87319
[581]	Test-mae:5.85507
[582]	Test-mae:5.83714
[583]	Test-mae:5.81839
[584]	Test-mae:5.80060
[585]	Test-mae:5.78311
[586]	Test-mae:5.76615
[587]	Test-mae:5.74926
[588]	Test-mae:5.73247
[589]	Test-mae:5.71593
[590]	Test-mae:5.69998
[591]	Test-mae:5.68423
[592]	Test-mae:5.66890
[593]	Test-mae:5.65273
[594]	Test-mae:5.63805
[595]	Test-mae:5.62283
[596]	Test-mae:5.60864
[597]	Test-mae:5.59448
[598]	Test-mae:5.58005
[599]	Test-mae:5.56525
[600]	Test-mae:5.55085
[601]	Test-mae:5.53667
[602]	Test-mae:5.52303
[603]	Test-mae:5.50932
[604]	Test-mae:5.49635
[605]	Test-mae:5.48241
[606]	Test-mae:5.46975
[607]	Test-mae:5.45648
[608]	Test-mae:5.44392
[609]	Test-mae:5.43124
[610]	Test-mae:5.41815
[611]	Test-mae:5.40672
[612]	Test-mae:5.39445
[613]	Test-mae:5.38357
[614]	Test-mae:5.37221
[615]	Test-mae:5.36145
[616]	Test-mae:5.35041
[617]	Test-mae:5.33890
[618]	Test-mae:5.32866
[619]	Test-mae:5.31817
[620]	Test-mae:5.30725
[621]	Test-mae:5.29740
[622]	Test-mae:5.28679
[623]	Test-mae:5.27743
[624]	Test-mae:5.26785
[625]	Test-mae:5.25798
[626]	Test-mae:5.24859
[627]	Test-mae:5.24017
[628]	Test-mae:5.23148
[629]	Test-mae:5.22293
[630]	Test-mae:5.21392
[631]	Test-mae:5.20502
[632]	Test-mae:5.19611
[633]	Test-mae:5.18814
[634]	Test-mae:5.17928
[635]	Test-mae:5.17143
[636]	Test-mae:5.16359
[637]	Test-mae:5.15591
[638]	Test-mae:5.14825
[639]	Test-mae:5.14030
[640]	Test-mae:5.13209
[641]	Test-mae:5.12402
[642]	Test-mae:5.11654
[643]	Test-mae:5.10913
[644]	Test-mae:5.10174
[645]	Test-mae:5.09393
[646]	Test-mae:5.08673
[647]	Test-mae:5.07963
[648]	Test-mae:5.07265
[649]	Test-mae:5.06595
[650]	Test-mae:5.05888
[651]	Test-mae:5.05229
[652]	Test-mae:5.04579
[653]	Test-mae:5.03942
[654]	Test-mae:5.03297
[655]	Test-mae:5.02718
[656]	Test-mae:5.02136
[657]	Test-mae:5.01563
[658]	Test-mae:5.01002
[659]	Test-mae:5.00498
[660]	Test-mae:4.99963
[661]	Test-mae:4.99422
[662]	Test-mae:4.98922
[663]	Test-mae:4.98386
[664]	Test-mae:4.97898
[665]	Test-mae:4.97414
[666]	Test-mae:4.96945
[667]	Test-mae:4.96457
[668]	Test-mae:4.96004
[669]	Test-mae:4.95585
[670]	Test-mae:4.95108
[671]	Test-mae:4.94681
[672]	Test-mae:4.94233
[673]	Test-mae:4.93820
[674]	Test-mae:4.93383
[675]	Test-mae:4.92990
[676]	Test-mae:4.92547
[677]	Test-mae:4.92182
[678]	Test-mae:4.91778
[679]	Test-mae:4.91427
[680]	Test-mae:4.91063
[681]	Test-mae:4.90735
[682]	Test-mae:4.90420
[683]	Test-mae:4.90084
[684]	Test-mae:4.89789
[685]	Test-mae:4.89490
[686]	Test-mae:4.89164
[687]	Test-mae:4.88853
[688]	Test-mae:4.88579
[689]	Test-mae:4.88306
[690]	Test-mae:4.87984
[691]	Test-mae:4.87703
[692]	Test-mae:4.87399
[693]	Test-mae:4.87131
[694]	Test-mae:4.86869
[695]	Test-mae:4.86607
[696]	Test-mae:4.86369
[697]	Test-mae:4.86091
[698]	Test-mae:4.85864
[699]	Test-mae:4.85627
[700]	Test-mae:4.85363
[701]	Test-mae:4.85193
[702]	Test-mae:4.85011
[703]	Test-mae:4.84797
[704]	Test-mae:4.84567
[705]	Test-mae:4.84359
[706]	Test-mae:4.84133
[707]	Test-mae:4.83896
[708]	Test-mae:4.83708
[709]	Test-mae:4.83528
[710]	Test-mae:4.83382
[711]	Test-mae:4.83183
[712]	Test-mae:4.83017
[713]	Test-mae:4.82846
[714]	Test-mae:4.82633
[715]	Test-mae:4.82474
[716]	Test-mae:4.82331
[717]	Test-mae:4.82267
[718]	Test-mae:4.82111
[719]	Test-mae:4.81957
[720]	Test-mae:4.81905
[721]	Test-mae:4.81799
[722]	Test-mae:4.81637
[723]	Test-mae:4.81481
[724]	Test-mae:4.81357
[725]	Test-mae:4.81197
[726]	Test-mae:4.81010
[727]	Test-mae:4.80876
[728]	Test-mae:4.80828
[729]	Test-mae:4.80755
[730]	Test-mae:4.80634
[731]	Test-mae:4.80532
[732]	Test-mae:4.80435
[733]	Test-mae:4.80323
[734]	Test-mae:4.80219
[735]	Test-mae:4.80170
[736]	Test-mae:4.80063
[737]	Test-mae:4.79978
[738]	Test-mae:4.79937
[739]	Test-mae:4.79851
[740]	Test-mae:4.79788
[741]	Test-mae:4.79740
[742]	Test-mae:4.79671
[743]	Test-mae:4.79573
[744]	Test-mae:4.79462
[745]	Test-mae:4.79425
[746]	Test-mae:4.79335
[747]	Test-mae:4.79331
[748]	Test-mae:4.79323
[749]	Test-mae:4.79279
[750]	Test-mae:4.79311
[751]	Test-mae:4.79265
[752]	Test-mae:4.79230
[753]	Test-mae:4.79131
[754]	Test-mae:4.79128
[755]	Test-mae:4.79088
[756]	Test-mae:4.79090
[757]	Test-mae:4.79054
[758]	Test-mae:4.79020
[759]	Test-mae:4.79004
[760]	Test-mae:4.78996
[761]	Test-mae:4.78983
[762]	Test-mae:4.78959
[763]	Test-mae:4.78958
[764]	Test-mae:4.78939
[765]	Test-mae:4.78878
[766]	Test-mae:4.78877
[767]	Test-mae:4.78894
[768]	Test-mae:4.78933
[769]	Test-mae:4.78905
[770]	Test-mae:4.78881
[771]	Test-mae:4.78883
[772]	Test-mae:4.78941
[773]	Test-mae:4.78898
[774]	Test-mae:4.78931
[775]	Test-mae:4.78929
[776]	Test-mae:4.78923
Stopping. Best iteration:
[766]	Test-mae:4.78877

Wall time: 1min 12s
In [101]:
# Predict
from sklearn import metrics
Y_pred = model.predict(train)

print("Training : metrics ...")
print('Mean Abs Error   MAE    : ', metrics.mean_absolute_error(Y_train, Y_pred))
print('Mean Sq  Error MSE      : ', metrics.mean_squared_error(Y_train, Y_pred))

print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_train, Y_pred)))

print('r2 value                : ', metrics.r2_score(Y_train, Y_pred))

Y_pred = model.predict(test)

print('\n')
print("Testing : metrics ...")
print('Mean Abs Error   MAE    : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq  Error MSE      : ', metrics.mean_squared_error(Y_test, Y_pred))

print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

print('r2 value                : ', metrics.r2_score(Y_test, Y_pred))
Training : metrics ...
Mean Abs Error   MAE    :  4.455029229440851
Mean Sq  Error MSE      :  58.8515100151492
Root Mean Sq Error RMSE :  7.671473783775136
r2 value                :  0.6401563599839779


Testing : metrics ...
Mean Abs Error   MAE    :  4.789231022270728
Mean Sq  Error MSE      :  58.27025971628166
Root Mean Sq Error RMSE :  7.6334959039932455
r2 value                :  0.6100741431454297

R2 of 0.64 & MSE of 58 on Training vs 0.61 R2 & 58 MSE on Testing looks okay. Increasing the max depth to a higher number (eg:20) gives better results on Testing set, but training score is poor. At max depth of 7, and other tuning parameters, this trade off gives a decent result.

In [ ]: